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Abstract 

A theorem of McCann [13] shows that for any two absolutely continuous probability measures 
on R d there exists a monotone transformation sending one probability measure to the other. A 
consequence of this theorem, relevant to statistics, is that density estimation can be recast in terms 
of transformations. In particular, one can fix any absolutely continuous probability measure, call 
it P, and then reparameterize the whole class of absolutely continuous probability measures as 
monotone transformations from P. In this paper we utilize this reparameterization of densities, as 
monotone transformations from some P, to construct semiparametric and nonparametric density 
estimates. We focus our attention on classes of transformations, developed in the image processing 
and computational anatomy literature, which are smooth, invertible and which have attractive 
computational properties. The techniques developed for this class of transformations allow us 
to show that a penalized maximum likelihood estimate (PMLE) of a smooth transformation 
from P exists and has a finite dimensional characterization, similar to those results found in the 
spline literature. These results are derived utilizing an Euler-Lagrange characterization of the 
PMLE which also establishes a surprising connection to a generalization of Stein's lemma for 
characterizing the normal distribution. 

Key words and phrases: Density estimation, Euler-Lagrange, penalized maximum likelihood, diffeo- 
morphism. 
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1 Introduction 



A theorem of McCann [13] shows that for any two absolutely continuous probability measures on 
M d there exists a monotone transformation sending one probability measure to the other. A conse- 
quence of this theorem, relevant to statistics, is that density estimation can be recast in terms of 
transformations. In particular, one can fix any absolutely continuous probability measure, call it P, 
and then reparameterize the whole class of absolutely continuous probability measures as monotone 
transformations from P. The advantage of this new viewpoint is the flexibility in choosing the target 
measure P which can allow prior information on the shape of true sampling distribution. For exam- 
ple, if it is known that the data is nearly Gaussian then choosing a Gaussian P along with a strong 
penalty on transformations that are far from the identity allows one to construct penalized maximum 
likelihood estimates which effectively shrink the resulting nonparametric estimate in the direction of 
the Gaussian target P. Moreover, when there is no knowledge about the true sampling measure, one 
can simply choose any absolutely continuous P and still construct a completely nonparametric density 
estimate. 

In this paper we utilize this reparameterization of densities, as monotone transformations from 
some P, to construct semiparametric and nonparametric density estimates. We focus our attention 
on classes transformations which are smooth, invertible and which have attractive computational 
properties. Formally, we model our data Xi,...,X n as being generated by some diffeomorphism 

R d — >• W 1 of some fixed but known probably distribution P on R d : 

X 1 ,...,X n ~ Po0. (1) 

The notation P o is taken to mean that the probability of X G A is given by F((j)(A)) where <fi(A) = 
{(j){x) : x G A}. An important observation is that the model ([I]) implies that (j>(X) ~ P. Therefore, 
one can imagine estimating (j) by attempting to "deform" the data X\ , . . . , X n by a transformation 
which satisfies 

4>(X 1 ),...,<j>(X n ) i ^¥. 

One of the main difficulties when working with such a model is the invertibility condition on the maps 
cj). The nonlinearity of this condition in M d when d > 1 makes constructing rich classes and optimizing 
over such classes difficult. One of the early attempts at circumventing such difficulties, found in [2], 
utilized the class of quasi-conformal maps to generate penalized maximum likelihood estimates of <j). 
However, these tools were only developed for R 2 with no clear generalization for higher dimension. 
In this paper we adapt the powerful tools developed by Grenander, Miller, Younes, Trouve and co- 
authors in the image processing and computational anatomy literature (see [24] and the references 
therein) and apply them to the estimation of (j) from data X x , . . . , X n . Indeed, these techniques allow 
us to show that a penalized maximum likelihood estimate of <p exists and that the solution has a finite 
dimensional characterization, similar to those results found in the spline literature (see [23]). 

We start the paper in Section [2] with an overview of using the dynamics of time varying vector field 
flows to generate rich classes of diffeomorphisms. Then in sections [3] and [4] we define our penalized 
maximum likelihood estimate (PMLE) of (j) and prove not only existence, but also establish a finite 
dimensional characterization which is key for numerically computing the resulting density estimate. 
In Section [5] we notice a surprising connection with the Euler-Lagrange equation for the PMLE of (j) 
and a generalization of Stein's lemma for characterizing the normal distribution (see [E]). In sections 
[6] and [7] we give examples of our new density estimate, first as a nonparametric density estimate and 
second as a semiparametric density estimate where a finite dimensional model is used for the target 
probability measure P. We finish the paper with an appendix which contains some technical details 
used for the proofs of the existence of the PMLE and for the finite dimensional characterization. 
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2 A rich class of diffeomorphisms 



In this section we give a brief overview of the technique of using the dynamics of time varying vector 
field flows to generate rich classes of diffeomorphisms. These time varying flows have been utilized 
with spectacular success in the fields of computational anatomy and image processing (see [TJ [4j [5j El 
d EHB HH H3l HSl HEl I2ni ISIl , and references therein). The tools developed in this body of 

work have been shown to be very powerful for developing algorithms which optimize a diverse range of 
objective functions defined over classes of diffeomorphism. It is these tools that we adapt for density 
estimation and statistics. 

A map (p: Q — >• R d is said to be a C k (Q, R d ) diffeomorphism of the open set Q C R d if is one-to- 
one, maps onto £7, and </>, </> -1 G C k (Q,,M d ). In what follows we generate classes of diffeomorphisms 
by time varying vector field flows. In particular, let {vt}te[o,i] be a time varying vector field in R d , 
where t denotes 'time' so that for each t, v t is a function mapping Q into R d . Under mild smoothness 
conditions there exists a unique class of diffeomorphisms of fi, denoted {0t}t€[o,i]> which satisfy the 
following ordinary differential equation 

d t r t (x) = v t (r t (x)) (2) 

with boundary condition <^q(x) = x, for all x G Q (see Theorem [l] below). The interpretation 
of these flows is that 4>t(x) represents the position of a particle at time t, which originated from 
location x at time t = 0, and flowed according the the instantaneous velocity given by v t . It will be 
convenient to consider the diffeomorphism that maps time t to some other time s, this will be denoted 

4>Ux) = 4>l{4>r\x)). 

For the remainder of the paper we will assume that at each time t, v t will be a member of a Hilbert 
space of vector fields mapping Q into R d with inner product denoted by (•, -)y and norm by || • \\y. 
Indeed, how one chooses the Hilbert space V will determine the smoothness properties of the resulting 
class of deformations {4>t}te[o,i]- Once the Hilbert space V is fixed we can define the following set of 
time varying vector fields. 

Definition 1. Let V^- ^ denote the space of measurable functions v t (x): [0,1] x Q — » R d such that 
v t G V for all t G [0, 1] and J* \\v t \\ydt < oo. 

One clear advantage of this class is that it can be endowed with a Hilbert space inner product 
if V is a Hilbert space. Indeed, is a Hilbert space with inner product defined by (v,h) V [o,i] = 

J^ivt, ht)ydt (see Proposition [l] in the Appendix or Proposition 8.17 in [24 ). For the remainder of the 
paper we typically use v or w to denote elements of V^ ' 1 ] and v t or w t to denote the corresponding 
elements of V at any fixed time t. An important theorem found in [24] relates the smoothness of V 
to the smoothness of the resulting diffeomorphism. Before we state the theorem, some definitions will 
be prudent. The Hilbert space V is said to be continuously embedded in another normed space H 
(denoted V ^ H) if V C H and there exists a constant c such that 

HI* <4v\W 

for all v G V where || • \\h denotes the norm in H. Also we let Co(^,R d ) denote the subset of 
C fe (^,R d ) functions whose partial derivatives of order k or less all have continuous extensions to zero 
at the boundary dCl. 

Theorem 1 ([24], [ID]). If V <—> Co(^,R d ) ; then for any v G V^ ^ there exists a unique class of 
C k (£} : M d ) diffeomorphisms {<fit}te[o,i] which satisfy ^ and (/)q(x) = x for all x G ft. 

To derive our finite dimensional characterization we will make the additional assumption that V is 
a reproducing kernel Hilbert space of vector fields. This will guarantee the existence of a reproducing 
kernel K(x, y): ftxft ^ W lXn which can be used to compute the evaluation functional. In particular, 
K has the property that for any xGO and / G V the following identity holds (K(x, f)y = f(x) -p 
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for all p G Mr. To simplify the following computations we will only work with kernels of the form 
K(x,y) = R(x,y)Idxd where i?:l]xl]^iisa positive definite function and Idxd is the d-hy-d 
identity matrix. 

Now the class of diffeomorphisms we consider in this paper corresponds to the set of all time 
varying vector field flows evaluated at t = 1: <j>\, where v ranges through V^. The class will 
be completely specified by the reproducing kernel R(x,y)Idxd which has the flexibility to control the 
smoothness of the resulting maps <p\ through Theorem [T] 



3 Penalized maximum likelihood estimation 

In this section we construct a penalized maximum likelihood estimate (PMLE) of <p\ given the data 
Xi, . . . , X n l ~ P o §\ where {4>t}te[o,i] satisfies ^ and v G yt ' 1 ]. Under mild assumptions on V and 
the density of P we prove the existence of a PMLE estimate of £>, whereby obtaining an estimate Po^J 
of the true sampling distribution. 

The target probability measure P is assumed known with a bounded density with respect to 
Lebesque measure on R d . Therefore, by writing the density of P as expH for some function H : R d — » 
R U { — oo}, the probability measure P o <p\ has density given by 

d¥o(/>l(x)= det(D</)X(x)) exptf o (j>\{x)dx 

where det(D(j)\(x)) is defined as the determinant of the Jacobian of <\>\ evaluated at x G ft (always 
positive by the orientation preserving nature of <j)\). Since <j)\ ranges over an infinite dimensional 
space of diffeomorphisms, the likelihood for v given the data will typically be unbounded as v ranges 
in yi ' 1 ]. The natural solution is to regularize the log likelihood using the corresponding Hilbert space 
norm on V^ ^ with a multiplicative tuning factor A/2. The penalized log-likelihood (scaled by 1/n) 

for the unknown vector field v flow given data X\ , . . . , X n ~ P o $\ is then given by 

1 n X f 1 

E ^ v ) = ~ Vlogdet[D0(X fe )] +Ho0(X k ) - - / \\v t \\ 2 v dt. (3) 

n k=i 1 Jo 

The estimated vector field v is chosen to be any element of V^ ' 1 ] which maximizes E\ over V^ ' 1 '. 
The following theorem establishes that such a v exists. 

Claim 1. Let V be a Hilbert space which is continuously embedded in Co(^,R d ) where Q is a bounded 
open subset ofR d . Suppose e H ^ is a bounded and continuous density on Q. Then there exists a time 
varying vector field v G yt ' 1 ] such that 

E x (v)= sup E x (v). (4) 
Proof. We first establish sup vG y[o,i] E\(v) < oo by splitting the energy E\ into three parts 

n n \ pi 

^AW = -E l0 g det ^l( X ^ + -E^°^( X ^-9 / W V tWv dt - ( 5 ) 

k = l k = l JU , 

\ v / V, v / v 

:=E!(v) :=E 2 (v) :=E ^ 

Notice that each term is well defined and finite whenever v G Vt ' 1 ], since the assumption V ^ 
Co(^,R d ) is sufficient for Theorem 8.7 in [24] to apply to the class yt ' 1 ]. In particular, for any 
v G V^ 0,1 } there exists a unique class of C 1 diffeomorphisms of {(f>t}te[o,i], which satisfies ^ (also 
see Theorem 2.5 in [10]). The term E2(v) is clearly bounded from above since sup xeQ H(x) < oo by 
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assumption. For the remaining two terms notice that the determinant of the Jacobian is given by 
log det D<fi\ (x) = f Q div Vt((f>t(x))dt (by equation (26) in the Appendix). Therefore 



(d^v t (^(X k ))-^\\v t f v yt 

1 n f 1 ( X \ 

-J2 [ sup \div v t (x)\ - -\\v t \\v)dt 
n k=1 Jo \xen z J 

f 

Jo 



< I \c\\v t \\ v 



dt, by the assumption V ^ Co(fi,M d ) 



Now let v 1 , v 2 , . 



be any maximizing sequence that satisfies lim m ^ 00 E(v m ) = sup vG y[ ,i] E\(v). 
Since sup vG y[ ,i] E\(v) < oo we can construct the sequence v m so that there exists an M < oo 
such that ||v m || y[ o,i] < M for all ra. Since ft is bounded, closed finite balls in V^ ^ = L 2 ([0, 1],V) 
are weakly compact (by [10 ). Therefore we may extract a subsequence from v m (relabeled by m) 
which weakly converges to a v G V^ ' 1 '. In particular, (v m , w) V [o,i] — >• (v,w) V [o,i] for all w G V^. 
Furthermore we have lower semicontinuity of the norm 



liniinflKII^ > HfiH^o,,,. 

m— )-oo 



(6) 



Now by Theorem 3.1 in [10 we have that (j)"™ (x) — >• 4>t(x) uniformly in t G [0, 1] as m — >• oo. This 
allows us to show that log det Dcj) 1 ^ (x) r ™^>° log det Dcj)\ (x) for every xGO. To see why, one can use 
similar reasoning as in [7 j. First write 



| log det D(j)\ (x) - log det D<j>\ (x 
where the first term / satisfies 
I 



Jo 



{x))-d\Yv t {(t>t{x))dt 



I + 11 



< 



< 



(\ivv?^r{x))-divv?{4{x))dt 
Jo 

f WAivvTh^f {x) - 4{x)\dt 
Jo 

Jo 



bf(x) - <fi{x)\dt, since V C 2 (ft,M d ) 



< cWv 11 



|y[o,i] 



uf{x)-4{x)\ dt 



1/2 



by Holder. 



= o(l) uniformly in t 

0, since ||v m || y [o,i] < M for all ra. 



For the second term II notice that the map sending v J Q div v t (yt)dt is a bounded linear functional 
on V^ 0,1 } (using the fact that V ^ Co(£l,lR d )) where y t = ffl (x). By the Riesz representation theorem 
there exists a/G yt ' 1 ] such that J* div v t (yt)dt = (v, w v ) v [o,i] • Therefore 



II 



divv t m (0t 0*0) - diYv t {^ t {x))dt 



(v 11 



— >• 0, by weak convergence. 
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Combining the results for / and II we can conclude that log det D^l™ (x) 77 ^>° l g det D(j)\ (x) for 
every xeO. 

To finish the proof notice that 

sup E x (v) = lim E(v m ) = limsup^(v m ) 

1 n 1 n x r l 



< 

n 



n n \ r l 

- V log det£>$ (X fc ) + - V Jf o </>* (X k ) - - lim inf / 
-^logdetD^(X fc ) + -^ J ffo^(X fe )-- / ||«t||^dt, by© 

Z~_1 Z~_1 17 u 



A:=l fc=l 



□ 



One of the important facts about any vector field flow v G V^ ' 1 ] which maximizes E\ is that 
the resulting estimated transformation <j)\ is a geodesic (or minimum energy) flow with respect to 
the vector field norm J Q H^Hy^- To see this is first notice that the parameterization of time t = 1 
maps, by vector fields v G Vt ' 1 ] is a many-to-one parameterization. In other words there exist 
multiple pairs of vector fields v,w G V^ ' 1 ' such that cj>\ = <fi™ but v ^ w. Notice, however, that 
the log-likelihood term in E\ only depends on This implies that any maximizer v of E\ must 
simultaneously minimize the penalty J* \\v t \\ydt over the class of all w G yt ' 1 ] which has the same 
terminal value, i.e. = <pf. Consequently, the PMLE estimate v must be a geodesic flow. An 
important consequence is that geodesic flows {vt}te [o,i] are completely determined by the initial vector 
field %. This will become particularly important in the next section where the initial velocity field 
will be completely parameterized by n coefficient vectors. 



4 Spline representation from Euler-Lagrange 

In this section we work under the additional assumption that V is a reproducing kernel Hilbert space. 
This assumption allows one to derive the Euler-Lagrange equation for any maximizer v of which 
satisfies Q . This leads to a finite dimensional characterization of v which parallel those results found 
in the spline literature for function estimation. 

Claim 2. Let V be a reproducing kernel Hilbert space, with kernel R(x, y)Idxd, continuously embedded 
in Co(^,M d ) where Q is bounded open subset ofR d . Suppose e H ^ is a C 1 (^,IR) density on Q. Then 
any time varying vector field v G yt 0,1 ] which satisfies (ul) also satisfies the following Euler-Lagrange 
equation: 

^ n 1 n 

" E ff,t R ( x > x m) + ^ E v v r ( x > y) y=Xk t ( ? ) 



k=l k=l 



where X k , t = 4>t(X k ) and /? M = VH(X ktl )D<f>* tl (X ktt ) + VlogdetZ^pf^). 

Proof. Let E\,E2 and E3 decompose E\ as in Notice first that if h G I/' ' 1 ! and e G R then 



2E 3 {v + eh) = A|| 

^||y[o,i] H~ e2A(v, h)y [0,1] + e 2 A||/i||^ [0 x] . Therefore E%(v + eh) is different iable with 
respect to e with derivative given by 



eh)\ e=0 = f\h t 
Jo 



d e E 3 (v + eh)\= / (h u Xv t ) v dt. (8) 
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In addition, Theorem 8.10 of [24] implies that ^ +e/l (x) is different iable at e = 0. Now, the assumption 
H G C 1 ^) combined with equation (|3l|), in the Appendix, gives 



n 

d e E 2 (v + eft)| e=Q = -- E Vff(^(X fc )) • S e ^(X,)| e=0 

fc=i 

i ™ r 1 

r 1 i n 

= / (^(•),--y;{Vir(X M )D0« 1 (X fc , u )} r ii(- ) X fc)U )) du (9) 

■'O n fe=l /v 

Finally, Proposition |3j from the Appendix, implies Es(v + eh) is differentiable at e = with derivative 
given by 

d.E^v + eh)\ e=0 = -^a e logdet^ +£fc (X fc )| e=0 

k=l 

1 n fir , 

= V/ ftu-V log det Zty^ + div/i„ o0*(X fc )cZw 

= / (M0,--E{ vl °s det ^<i^ 



v . , du 

<*k,U / y 

(10) 

Remark: the above equation requires d Xi {e% ' h u (x)) = d Xi (eiR(- : x),h u )v = (eid Xj R(-,x),h u )y which 
follows since divh u G V by the assumption 1/ ^ C$(ft,M d ) (see [3 ). Now from ([9| and ( fTo| ) ? the 
energy E\ (v + eft) is differentiable with respect to e at and 

= d e E x (v + eft)| c=Q = fc) VI o,i] (11) 

where 

-j n i n 

# = " " E^(-' X M) " " E V„i?(-,I/)| I/= ^ jt (12) 

fc=l fc = l 

with /3 M = VH(X ki i)D^ 1 (X ki t) + V logdet D^X^t). Since ft G V [0 ' 1] was arbitrary, equation 
(11) implies £ v = 0, which then gives Remark: we are using the fact that the zero function in a 
reproducing kernel space is point-wise zero since the evaluation functionals are bounded. □ 

There are a few things things to note here. First, the Euler-Lagrange equation ^ only implicitly 
characterizes v since it appears on both sides of the equality (/3k, t an d Xk t t also depend on v). 
Regardless, ([7]) is useful since it implies that v must lie within a known n x d dimensional sub-space 
of V^ ' 1 '. In particular, as discussed at the end of Section pi the estimate {vt}te[o,i] is completely 
characterized by it's value at time t = 0, i.e. vo (by the geoaesic nature of v). Restricting equation 
(|7|) to t = one obtains 



1 n i n 

w = ^E 0lo R ( x > x ^ + w E v v R ( x > y) y=Xk ■ ( 13 ) 



V °""' An^ r ""' u " v ~'"""" ' An 

fe=i fc=i 



Simply stated, Vq has a finite dimensional spline characterization with spline knots set at the observa- 
tions X\, . . . , X n . Therefore to recover {vt}te[o,i] one simply needs to find the n vectors (3i t o, . . . , /? n> o 
which satisfy the following fixed point equation 

ft,o = VH(4(X k ))D4(X k ) + V log det D4(X k ) (14) 
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for all k = 1, . . . , n. 



5 Connection to Stein's Method 

The Euler-Lagrange equation given in Q has a surprising connection with a generalization of Stein's 
lemma for characterizing the normal distribution (see [19]). The main connection is that the Euler 
Lagrange equation for the PMLE estimate v t , simplified at initial time t = and terminal time 
t = 1, can be reinterpreted as an empirical version of a generalization of Stein's lemma. This is 
interesting in it's own right, however, the connection may also bear theoretical fruit for deriving 
asymptotic estimation bounds on the nonpar ametric and semiparametric estimates derived from v. 
In this section we make this connection explicit with the goal of of motivating and explaining the 
Euler-Lagrange equation for v derived above. 

To relate Vt at t = with Stein's lemma, and more generally Stein's method for distributional 



approximation, first notice that (14) implies the coefficients /3k,o> from the implicit equation (13) for 
v, satisfy /3k,o = V\og f(Xk) where / = e 110 ^ 1 \D(j>i \ is the estimated density of X using the pullback 
of the target measure with the estimated diffeomorphisms <p\. Now by computing the inner product 
of both sides of the Euler-Lagrange equation (p~3|) with any vector field u G V and applying the 



reproducing property of the kernel •) one derives 

X(vo,y) v =En{Vlogf(X)'u(X)+divu(X)}. (15) 

where E n denotes expectation with respect to the empirical measure generated by the data: ^ J2k=i $x k 
To relate with Stein first let E denote expectation with respect to the population density / = e Ho ^\D(j)\ 
given in our basic model 0. Notice that a generalization of Stein's lemma shows that if the densities 
/ and / give rise to the same probability measure then 

= E{Vlog/(X) • u(X) + divu(X)} (16) 

for all u in a large class of test functions U (see Proposition 4 in [H]). For example, a simple 
consequence of Lemma 2 in [18] implies that when / is the density of a d dimensional Gaussian 
distribution A/^(/i, 1) and X ~ A/^(/i, 1) then /i = /i implies E{ — (X — ft) • u{X) + divu(X)} = 
for any bounded function u : R d — ^ R d with bounded gradient. Stein's method, on the other hand, 
generally refers to a technique for bounding the distance between two probability measures / and / 



using bounds on departures from a characterizing equation, such as (16) for example (see [8] for an 
exposition). The bounds typically take the form 



sup 

hen 



(hf-hf) < sup|E{Vlog/(X) >u(X) +divu(X)}| (17) 

ueu 

where H and U are two class of functions related through a set of differential equations. In our case, 



applying a Holder's inequality to the Euler-Lagrange equation (15) gives a bound on right hand side 



of (17) in terms of a regularization measurement on the PMLE v and an empirical process error: 



sup|E{Vlog/(X)-^(X) + div^(X)}| < A||v ||vsup|H| v + sup|(E - E n )v fu | (18) 
ueu ueu ueu 

V v ' V v ' 

regularization at t = empirical process error 

where Vf u = Vlog/(X)^(X) + divu(X). This makes it clear that theoretical control of the PMLE 



estimate v t at time t = 0, using the Euler-Lagrange equation characterization (13), allows asymptotic 
control of the distance between the estimated density / and the true density /. 

At terminal time t = 1, there is a similar connection with Stein's lemma. In contrast to time t = 0, 
which quantifies the distance between the estimated and population densities / and /, time t = 1 
quantifies the distance between (j)(X) (the target measure) with (j)\{X) (the push forward of the true 
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population distribution though the estimated map). To make the connection, one follows the same 
line of argument as above to find that for any u G V 

X(v u u)v = K{VH(X) • u(X) + divu(X)} (19) 

where E^ denotes expectation with respect to the empirical measure ^ Yl7=i ^>f(x fe )' which is simply 
the push forward of the empirical measure ^ Ym=i ^ x k through the estimated map <j>\. Now the analog 
to (18) becomes 

sup|E^{V^(X)-^(X)+div^(X)}| < A||fii|| v sup|H|v + sup| (E* - E^) 7n | (20) 
ueu ueu ueu 

V v ' 

regularization at t = 1 

where j u = VH{X)u{X) + divu(X) and E v denotes expectation with respect to the push forward 
of the population density / = e Ho ^\D(/)\ though the estimated map <p\. Since the target measure P 
is assumed to have density e H , this bounds the distributional distance between <j)\{X) and P when 
X ~ P o 6. 



6 Nonparametric example 

In this section we utilize the finite dimensional characterization of the PMLE v at time t = 0, given 
in (13), to construct nonparametric density estimates of the form / = e Hoc ^ 1 \D(j)\\ from iid samples 



Xi,...,X n . As was discussed in the introduction, so long as the target measure P is absolutely 
continuous, the assumption that Xi, . . . , X n % ~ Po <fi encompasses all absolutely continuous measures. 
Since the class of diffeomorphisms {<j)\ : v G V^ ' 1 ]} is nonparametric, the estimate / = e Ho ^\D<p\\ is 
inherently nonparametric regardless of the choice of target probability measure P (with density e H ). 
In effect, the choice of target P specifies a shrinkage direction for the nonparametric estimate: larger 
values of A shrink / further toward the target P. In this section we illustrate the nonparametric nature 
of the density estimate /, whereas the next section explores semiparametric estimation with parametric 
models on the target P. One key feature of our methodology is the use of the Euler-Lagrange equation 
([7]) as a stopping criterion for a gradient based optimization algorithm for constructing v. In fact, 
to avoid computational challenges associated with generating geodesies with initial velocities given 
by ( [l3| , we consider a finite dimensional subclass of V^ ' 1 ] which have geodesies that are amenable 
to computation (and for which gradients are easy to compute). The key is that we use the Euler- 
Lagrange identity fl\ to measure of the richness of the subclass, within the larger infinite dimensional 
Hilbert space V I ' ^whereby allowing a dynamic choice of the approximating dimension for a target 
resolution level. 

Claim [2] shows that the PMLE vector field v G V^ 0,1 ] obeys a parametric form determined up 
to the identification of the n functions t \-> /S^t as t ranges in [0,1]. Moreover, the whole path 
of coefficients /S^t 1S determined from the initial values /3& 5 o, by the geodesic nature of v. In this 



way, we are free to optimize, over the vectors {/3i,0j • • • ? /? n ,o} c ^ using equation (13) and are 
guaranteed that the global maximum, over the full infinite dimensional space {<t>\ \ v G V^}, has 
this form. Unfortunately, deriving geodesic maps with this type of initial velocity field is challenging. 
To circumvent this difficulty we choose an approximating subclass of vector fields at time t = which 
are parametrized by the selection of N knots {fti, . . . , ft/v} C Q and N initial momentum row vectors 
{^1, • • • , Vn} cR d and have the form: 



N 



v (x) = ^2^R(x^ k ). (21) 



k=i 



The knots {fti, . . . , ft/v} need not be located at the data points {Xl, . . . , X n }. Indeed, we will see that 
alternative configurations of knots can be numerically beneficial. The key point is that vector fields 
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Figure 1: In this example we compare two different knot configurations, in (21), for generating non- 
parametric density estimates using approximate solutions to the Euler-Lagrange equation ([7]). The left 
column of images shows two different density estimates (red), based on the same data set (blue), using 
two different knot configurations (top- left uses 10 knots, bottom-left uses 30 knots). The right column 
of images show the corresponding diagnostic curves which characterize the richness of the approxi- 
mating subclass generated by the knots. The fact that the two diagnostic curves shown bottom-right 
are similar suggests that the 30 knots used generate the approximating subclass by ( [21] ) is sufficiently 
rich to reach the stationary points of the penalized log likelihood E\ given in (J3|. See Section [6] for 
details. 



at time t = 0, which satisfy (21), generate geodesies with respect to norm [J^ \\v t \\ydt\ 1 2 that are 
easy to compute. Moreover, trie variational derivatives of the terminal map <\>\ with respect to the 
initial r\ coefficients and the knots k are easily computed when utilizing similar techniques as those 
developed in [22] and pQ. This enables efficient gradient based algorithms for optimizing the PMLE 



criterion over the class generated by (21). 



As a first illustration, we show that the naive choice of initial knots obtained by setting . . . , kn} '- 
{Xi, . . . , X n } in (21 ) is not sufficient to solve (|7|); then show how it can be easily fixed using the Euler- 
Lagrange methodology. Our data set, shown with blue sticks in Figure [I] consists of n = 10 indepen- 
dent samples from a mixture of two normals, truncated so the support is [0, 1]. Our target probability 
measure P is set to the uniform distribution on [0, 1] (smoothly tapering to zero outside of [0, 1] 

for numerical convenience). For simplicity we choose the Gaussian kernel R(x,y) = exp(— ^ 2 ~^ ), 
with o~ = 0.1, to generate the RKHS V and use the penalty parameter A set to 10. The top left 
plot in Figure [l] shows the non-parametric density estimate / = e Ho( ^ Vl \D(j)W in red, generated 
by applying a gradient based optimization algorithm applied to the subclass (21) where the knots 



{fti, . . . , ft/v} = {Xi, . . . , X n } are kept fixed and the coefficients 771, . . . , t]n are optimized by mini- 
mizing the penalized log likelihood function E\(v) given in To dia gnose the richness of subclass 
(21) within the full Hilbert space we define the function V^{x) for any v G V^ ' 1 ] and any t G [0, 1] as 
follows 



k=l 



k=l 



V y R{x,y) 



(22) 



where X% t 



$(X k ) and Pl t = VH{X v k l )D(j) v tl {X v Kt ) + V logdet D^(X^). The function Av t - 
serves as a diagnostic criterion in the sense that the Hilbert norm of Xv t — gives the maximal 
rate of change of the penalized log-likelihood E\(v), within the full infinite dimensional Hilbert space 
V^ ' 1 '. In particular, 



y 

Jo 



diagnostic 



\Ut 



1/2 



sup 

IMIy[0,i] 



-1} 



E x (v + eu) 



Therefore if Xv t (x) — T)^(x) = for all t G [0,1] and x G then v satisfies the Euler-Lagrange 
equation. Discrepancies between Xv t (x) and VI when optimizing over the subclass (21) indicates the 
subclass that is insufficient rich to reach the stationary points of E\(v). The diagnostic plots in this 
example, which correspond to our density estimate shown in the upper- left image of Figure [I] are 
shown in the upper-right plot of Figure [I] where Xvo(x) is plotted in black and T>q(x) is plotted as 
a dashed green line. The large amount of discrepancy between Xvo(x) and Vq(x) indicates that the 
knots {^i, . . . , kn} = . . . , X n } are insufficient. 

To generate knots which are sufficiently rich, in this first example, we apply a discrete approxima- 
tion at initial time t = to the gradient term V y R{x, y) appearing in the Euler-Lagrange equation 
([7]) . For this approximation we use N = 3n knots in the pattern given by the following approximation 



^ n 1 n 



k=l 



Xn 



k=l 



y=x k 



^ n 1 n 



An 



k=i 



k=l 



N 



j2vk R ( x ,Kk) 



(23) 
(24) 
(25) 



k=i 



where K k = < 



'k,0 



1 

SXr, 



if k G 1 . . . n 

if k G n + 1 . . . 2n with S = 10" 
if k G 2n + 1 . . . 3n 



Xk if k G 1 . . . n 

Xk + § if fc G n + 1 . . . 2n and ^ = 
K X k - I if fc G 2ra + 1...3ra 
With this new set of knots, the resulting PMLE over the new class is show at bottom left in Figure [T] 
Notice that now the diagnostic function A^o — Vq (the difference between the black and green line in 
the bottom-right plot of Figure [T]) is much closer to zero. Indeed, for every t G [0, 1] the diagnostic 
function Xv t —T>l is similarly close to zero (not pictured). This implies that the maximal rate of change 
of the penalized log-likelihood within the infinite dimensional Hilbert space yt ' 1 ], at our estimate, is 
very small and hence our knots are sufficiently rich. 

In the previous example we used N = 3n knots in (21) to construct a sufficiently rich class 
for solving the Euler-Lagrange equation 0. Now we demonstrate that with larger data sets and 
smaller smoothness penalties one can actually use a smaller set of knots, iV«n, to approximate the 
solutions to Euler-Lagrange equation (|7|). The histograms in the left column of Figure [2] show n = 240 
iid samples from the same truncated mixture of normals used in the previous example. The resulting 
density estimates, shown in red, use a smoothness penalty set to A = 1/4. The estimate shown top- left 
utilizes N = n = 240 knots set at the data points whereas the estimate shown bottom-left uses N = 20 
knots randomly selected from the data. The right column shows the corresponding diagnostic plots 
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Figure 2: In this example we demonstrate that a small number of knots, in pi] ), can be enough 
to approximate solutions to the Euler-Lagrange equation Q. The left column of images shows two 
different density estimates (red), based on the same data set of size n = 240 (grey histogram), 
using two different knot configurations (top- left uses 240 knots, bottom-left uses 20 knots). The 
right column of images show the corresponding diagnostic curves which characterize the richness of 
the approximating subclass generated by the knots. The fact that the two diagnostic curves shown 
bottom-right are nearly identical suggests that the 20 knots used generate the approximating subclass 
by ( [21] ) is sufficiently rich to reach the stationary points of the penalized log-likelihood E\ given in 
([3]). See Section [6] for details. 

(A^o shown in black and Vq shown in green). The relative agreement of the diagnostic curves in the 
bottom-right plot suggests that 20 knots are reasonably adequate for finding approximate solutions 
to the Euler-Lagrange equation. We expect this situation to improve as the number of data points 
increase. This has the potential to dramatically decrease the computational load when applying this 
estimate to extremely large data sets. 

7 Semiparametric example 

In this section we demonstrate how the PMLE can be used to generate semiparametric estimation 
procedures obtained by assuming a parametric model on the target distribution P then introduce a 
nonparametric diffeomorphism to the target model. Indeed, any parametric model {Pq Jg0C R m } 
can be extended to a semiparametric class by considering diffeomorphisms of the data to the parametric 
target as follows: {P e o <j>\ : £ Q,v G V^ ^}. Since the model X u ...,X n ~ F e o <\>\ implies 
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<Pi (Xi ),..., (j>\{X n ) ~ P<9 it is natural to alternate the optimization of 9 and <j) to compute the 
estimates 6 and under this semiparametric model. This optimization routine is outlined explicitly 
in Algorithm [I] 

Algorithm 1 Compute the semiparametric estimates <j) 
l: Set i = and initialize ((9°, 0°). 

2: Set to the PMLE of <\> defined in Section [i] under the model X u . . . , X n ~ o 0. 
3: Set to the maximum likelihood estimate of G under the following model for the trans- 
formed data points: 

<+1 (Xi),...,0 <+1 (X n )^P 

4: If « and ^ « (/>* +1 then return (0,0) <- (0 Z+1 ,0 Z+1 ); else set z <- z + 1 and return to 
step 2. 



To illustrate the semiparametric nature of our estimate we sample from a population density which 
is a mixture of a x 2 density (with 20 degrees of freedom) and a Gaussian density (/i = 55 and a = 3) 
shown in black on the left column of plots in Figure [3j The data comprises n = 200 independent 
samples from this mixture, the histogram of which is shown on the left column of plots in Figure [3J 
We consider two different semiparametric estimates of the population density. The first uses a basic 
location-scale Gaussian family to the parametric target model {Pq : G 0} = {G^ ?CT : \i G R, a G M + } 
where G M ,cr denotes the Gaussian measure centered at \i with variance a 2 . The second example uses 
a mixture of two Gaussian measures {Fq : 6 G 0} = {ftG^^ + (1 — a)G /i2?cr2 : /ii,/i2 £ R, 0"i?0"2 £ 
M + ,0 < a < 1}. As in Section [6] we use a Gaussian reproducing kernel to generate the Hilbert space 
V . In this example, however, we use a wider kernel, with standard deviation set to half the sample 
standard deviation of the data. This is done to illustrate the flexibility in the estimated density 
obtained by simply changing the kernel width and the penalty parameter A (which is decreased 
to 1/2500 in this example). Wider kernels tend to produce estimates which have restricted local 
variability but can still have sufficient flexibility to model large amplitude variations over large spatial 
scales. 

The estimate obtained from the basic location-scale Gaussian family is shown in red in the top-left 
plot of Figure [3] Conversely, the estimated density which uses the mixture target model is shown in 
red in the bottom- left plot of Figure [3] The corresponding estimated target density dFg / dx is shown 
in green on the left two plots. To numerically approximate the PMLE initial velocity field vq, needed 
in step 2 of Algorithm [T] we used the approximating subclass for the initial velocity field given in the 
form (21) with 200 knots located at the data values. The corresponding time zero diagnostic plots 
are shown in the right column of Figure [3] (green for A^o and black for T>q). Notice that in both 
cases the semiparametric estimates do a good job at estimating the population density. In the case 
of the location-scale Gaussian target the estimated target density does a poor job of explaining the 
true density. However, the presence of a nonpar ametric diffeomorphism allows this model to fit nearly 
as well as a fit from a mixture model. Notice also that the semiparametric estimate based on the 
location-scale Gaussian target overestimates the true sampling density between the two modes. This 
seems due to the fact that the estimation procedure prefers an overly dispersed target density which 
allows the estimated diffeomorphism to effectively add mass around the smaller mode. The situation 
seem to be corrected when using a mixture. 



A Technical Details 

This section serves to present some technical details which are used in the proofs of Claim (Tland Claim 
[2] Some of these results can be found in the current literature (for example, Proposition]!] most of 
Proposition |2] and equation (31) can be found in [24 ). However, the main goal of this section is to 
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Figure 3: In this example we demonstrate that by parametrically modeling the target distribution 
one can produce flexible semiparametric density estimates. The left column of histograms show the 
data (the same histogram plotted twice) sampled from the population density shown in black. Two 
semiparametric estimates are shown in red which correspond to different parametric targets. The 
estimated target distribution is shown in green on the left column of images. The right column of 
images show the corresponding diagnostic curves which characterize the richness of the approximating 
subclass generated by the knots (A^o is plotted in black and Vq is plotted in dashed-green) . See Section 
0for details. 



establish equation ( 32 ) in Proposition [3] which is key to establishing Claim [2] We mention that all 
of the derivations presented in this section rely heavily on techniques developed by Younes, Trouve, 
Miller and co-authors (see [24] and references therein). 

To set notation let C fe (fi,R d ) denote the set of functions, mapping an open set Dctf into 
which have continuous derivatives of order < k (so that C°(^,IR d ) is the continuous functions on Q 
mapping into R d ). Also let C k (£l, W 1 ) denote the set of functions in C fe (^, M d ) whose derivatives of or- 
der < k have continuous extensions to Cl. Finally, Cq(Q, M d ) is the set of functions in C k (£l, M d ) whose 
derivatives of order < k take the value on the boundary dQ. It is a well known fact that C k (Ct, R d ) 
is a Banach space with respect to the norm: ||/||c fe (fj) = ||/IU,<x> = Ej=o su P|/3|=j SU P^ l-^/l- 

Remark: The norm \\v\\y = Jq \\vt\\ydt given in Definition [l] is technically only a semi-norm since 
one is free to change v t on a set of £ £ [0, 1] with Lebesque measure zero (and not effect the norm 
on yt ' 1 ]). This is easily fixed by identifying yC ' 1 ] with the set of equivalence classes of measurable 
functions where {^t}te[o,i] an d {w t }te[o,i] are sa id to be in the same equivalence class if \\vt — wt\\v — 
for almost every t G [0, 1]. For the remainder of the paper we treat this identification as implicit with 
the understanding that {^t}te[o,i] denotes a representer of the equivalence class to which is belongs. 
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The following proposition establishes the Hilbert space structure of V^ 0,1 ^ (stated without proof in 
8.17 of [24]). 

Proposition 1. If V is a Hilbert space with inner product (*,-)y ; then V^ ' 1 ] is a Hilbert space with 
inner product defined by (v,h) V [o,i] = (v tl ht)vdt. 

Proof. Notice first that \\v t \\v and (v t: h t )v are measurable functions of t (this can be taken to be 
implicit in definitional requirement for membership in V^ ' 1 ': that \\vt\\ydt < oo). Now, with the 
exception of completeness, all the properties of a Hilbert space inner product are inherited from (•, -)y 
and the linear properties of Lebseque integration over [0, 1]. To show completeness let v n G V^ ^ be 
a Cauchy sequence so that \\v™ — v™\\ydt — >• 0. By the completeness of V there exists a Borel set 
B C [0, 1] such that for alH G B there exists a v t G V such that \\Vf— v t \\v —> 0. On t G [0, 1]\B we are 
free to set v t = (the zero element of V). For this v t we have that \\v™— Vt\\y [0jl] = Jq \\Vf— v t \\ydt — > 0. 
Therefore l^ ' 1 ] is complete. □ 

Proposition 2. Let ^ be an open bounded subset of W d and V be a Hilbert space such that V ^ 
Cg(f},R d ). If v E V^ ^ then there exists a unique class of C 1 diffeomorphisms ofQ, {(t>t}te[o,i]> suc ^ 
that 4>t(x) G C°([0, 1] x ^,R d ) and which satisfy the ordinary differential equation d t (f)^(x) = v t (<p^(x)) 
with boundary condition 4>q{x) = x, for all x G Q. Moreover, 



■ v st (x)= f divv u (4>Ux))du. (26) 

J S 



log det Dcj) 

Proof. First note that if v G V [0 ' 1] and V ^ C^(Q,R d ) then \\v t \\ hoo < c\\v t \\ v . Now by Holder, 
Jo 1 IWIv^ < Ibllyi 01 ] < 00 so that the arguments for Theorem 8.7 in [24] to apply to the class V^ ' 1 '. 
In particular, there exists a unique class of C 1 diffeomorphisms of (/>t( x ) £ C°([0> 1] x ^,R d ), which 
satisfy the ordinary differential equation d t (f>t(x) = v t {(j)t{x)) with boundary condition (J)q(x) = x, for 
all Moreover, by Proposition 8.8 in [24] we have that 

d t D4> v st {x) = Dv t {4> v st (x))Dr st {x) (27) 

where det D(j) v ss {x) = Ida. Since D(j) v st (x) is nonsingular and differentiable in t we have that (see 
(6.5.53) of [T2] . for example) 

ftIogdet£>^ t (x) = trace^^M] -1 ^^)} 

= trace{[DC*(^)]- 1 Dft(C t W)^(^)}, by ((27) 
= trace{ J D^t(0st(^))} 
= div^(C t W)- 

Therefore log det D(jf st (x) is differentiable everywhere on t G [0, 1] with derivative given by div v t ((j) v st (x)) . 

Since v t (x) is measurable with respect to both arguments t and x (by definition) and limits of 
measurable functions are measurable, the function div^(x) is also measurable. Since (j) v st {x) is con- 
tinuous with respect to both t and x, div v t (<fi^ t (x)) is also measurable. Notice that divv t (<p^ t (x)) is 
also Lebesque integrable since \divv t ((p^ t (x))\ < c\\v t \\v by the embedding V ^ Q)(^,lR d ) and the 
fact that that jj" < ||^||y[o,i] < oo. Therefore by Theorem 7.21 of [17] we have that 



\ogdetD(/) v st (x) = / d\vv u {(j) v su {x))du 

J s 



since log det D(j) v ss (x) =0. □ 
Lemma 1. If V ^ C^(Q, R d ) and v,w G then 

Wst ~ Ctlloc < c\\v - HUn exp (cIMUu) (28) 
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where c is a constant which does not depend on v,w,s or t. Moreover, if we additionally suppose 
V M> C'i(n,R d ) then 



U v st ~ «lli,oo < \\v - HU.^flMU.n, ||«;|| v to,i]) 



(29) 



where zs a /?m£e function on R x R ; monotonically increasing in both arguments, which does 

not depend on v, w, s or t. 



Proof. The inequality (28) follows directly from Grownwell's lemma applied to the following inequality 



< 



J v u (<j> v su {x)) - w u (<j>™ u (x))du 

v u (cj> v su (x)) - v u (^ u (x))\du + f \v u (<j>™ u (x)) - w u {^ u {x))\dv 

J s 



< / 4 v u\\v\^ v su (x) - </>™ u (x)\du + c\\v ~ W\\ V [0,1] 



where the last inequality follows from the assumption V ^ Cq(Q,] 

To prove (29) notice that for any vector h G M d we have that d t D(jf st (x)h = Dv t {(j) v st {x))D(j) v st {x)h 
where D(j) v ss (x)h = h (by Proposition 8.8 in [24] and also [10]). Therefore 



D<t> v st {x)h - D<ft t (x)h 



Dv u (<p v au (x))D<p v su (x)h - Dw u (4>: u (x))D^ u (x)h 



du 



(30) 



where we are using the fact that D(jf st (x)h is differentiable with respect to t everywhere in [0, 1] and 
with Lebseque integrable derivative (and using Theorem 8.21 of [IT]). Now notice that the integrand 
of (30) satisfies 



where 



and 



\Dv u {4>l u {x))D4>: u {x)h - Dw u {<t> w su {x))D<t>™ u {x)h\ < I + II 
Dv u (4>: u (x)){Dcf>: u (x)h - D<PZ(x)h}\ < c\\v u \\ v \D<Pl u (x)h - Dcf>f u (x)h 
II = \{Dv u {4> v su { X )) - Dw u {c^ u {x))}D^ u {x)h\ 

< {||^||2,oo||C„ - CJloo + \\V U - W u \\ l oo } ||C„|| 1)00 N 



< 



[c|K|| y \\v - w\\ vl0 ,i] exp (c||w|| y[ o,i]) + c\\v u - w u \\ v j ||C«||i >00 \ h \ 



where the last inequality follows by (28). To bound II further notice ||0g t ||i >O o < ci exp (c||w|| v [o,i]). 



To see why, apply Gronwall's lemma to the following inequality 



m t (x)-€t(y)\ 



x-y+ v u ((j) v su (x)) - v u ((j) v su (y))du 



<\x-y\ 



c\\Vu\\v \€u( X ) ~ ^ V su(y)\ du 



which yields \(jf st (x) — (j) v st (y)\ < \x — y\ exp (c||i>|| v [o,i\)- Since (j) v st (x) is differentiable with respect to 
x everywhere in ^, for each multi-index /3 such that |/3| = 1 there exists a direction hP G M d (with 
\hP\ = l) such that 

\D^ st{ x)\ = lim mX + ek0) ~ +«™ < exp (c| HUll) . 

e^O e 
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Combining the above inequality with the fact that Halloo < sup^^ \x\ gives the desired inequality 
ll^atlkoo < ciexp(c llvll^o.i]). Applying this to II gives 



II < 



{c\\v u \\ v \\v - w\\ V [ ,i] exp (c||iy||y[ ,i]) + c\\v u - w u \\ v j c\ exp (c||w|| v [o,i])|ft| 



Therefore 



J II du < cic|ft|||i; - ^||y[o,i]|||v||y[ ,i] exp(2c||iy||y [ o J i]) + exp(c|H| y[ o,i]) | 
= \h\\\v - w\\ V [o,i]F(\\v\\ V [o,i},\\w\\ V [o,i]) 



where F(x,y) is monotone and finite in both x and y. Now by equation (30) we have that 
\Dcj) v st (x)h - D(j^ t (x)h\ < I Idu+ f II du 

J s J s 

+ \h\\\v - w\\ V [o,i]F(\\v\\ V [o,i], |H|y[o,i]). 

By Gronwell's lemma we have that 

\D(j) v st {x)h - D4>™ t {x)h\ < \h\\\v - w\\ v[ o,i]F(\\v\\ v[ o,i] , |H| y[ o,i]) exp(c||v|| y[0>1] ). 



Now by taking a supremum over x G \h\ = 1 and combining with ( |28| ) gives ( |29| ), after redefining 
F to accommodate the extra term exp(c||v||^ [0>1] ). 

□ 



Proposition 3. Ifv,he V [0 ' 1] and V ^ (ft, R d ) then for all x e Q and s, t G [0, 1] 



If, in addition, V ^ Co(ft,R d ) then 

d e logdetD^ h (x)\ e=0 = f 

Jo 



v+eh 
ut 



h u }o<t>l+* h (x)du. 



h u ■ V log det D<\> v u \ + div h u o § v u (x) du 



(31) 



(32) 



Proof. The assumption that y,/iG yt ' 1 ] and V ^ Co(f2,R d ) are sufficient to apply Theorem 8.10 of 
[24] which gives 



^rW| e=0 = / {D4>lthu}o4>Ux)du. 



Now since d e <j) v s f eh (x) = ^0^ e/?,+ ^(x)|^ o one immediately obtains (31). 

To show (32) notice that partial derivatives on x can pass under trie integral in (31) to compute 
£><9 e (/>i +e/ \ This follows by first noticing that V ^ C^(ft,R d ) implies 



sup \\Kt eh \\ 2oo <ciexp(c 2 |M| y[0 ,i] +M||ft|| v[ o,i]) 
ue[o,i] 



(33) 



for all |e| < M, by equation (8.11) of [24]. Therefore when fixing v,he F [0 ' 1] the function \\D(/)l+ eh h u \\ hc 
is bounded above by a finite constant over (u,e) G [0,1] x (— M, M). With an additional appli- 
cation of Proposition 8.4 of [24 we also have that ^{Dcj^^hy} o (/^^l^ < oo uniformly over 
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(u,e) G [0,1] x (— M, M). Therefore, indeed, partial derivatives on x can pass under the integral in 
( |3l) ) to obtain 

Dd t 4>l +th {x)= I D[{D4> v u r h h u }o4,l+^{x)\du. (34) 
Jo 

Now we show that Dd e (j)\ +eh {x) is continuous over (x, e) G £1 x (— M, M). This will allow us to 
switch the order of D and d e and establish (1321). The same reasoning which allows D to pass under 



the integral in (31) also allows us to pass limits on x and e under the integral in (34). Therefore 
it will be sufficient to show the integrand, D [{Z}(/>^j~ e/l ft n } o ^ +e/l (x)], in (34) is continuous over 
(x, e) G ^ x (— M, M). To see why the integrand in (34) is continuous first note that (j) v s f eh is 
a C 2 diffeomorphism (by a similar proof Theorem 8.7 in [24]). Secondly, under the assumption 
V ^ Co(^,M d ) one can extend (29) to bound \\<fist~ eh — ^t^lkoo by c|£— e|, where c is a finite constant 
which may depend on v, ft but not on £, e. These two facts imply that D [{D(j)^ eh h u } o ^ +e/l (x)] is 
indeed continuous in (x, e) G ^ x (— M, M) which implies that Dd e (j)\ +eh (x) is also. 

The continuity of D<9 e <^ +e/l (x) over (x, e) G fix (-M, M) implies d e D(/) v + eh exists and Dd e (/) v s ? eh = 
d e D(j) v s ^ eh (see [9], page 56). Then since D(j) v s ^ eh is nonsingular (by the diffeomorphic property) and 
different iable with respect to e we have that 



d e \ogdet D(t> v s t eh {x) 



tmce{[D^\x)]-%D^ h (x)} 
= t^ce{[D^t e \x)}- 1 Dd^T\^)}- 



Therefore, by (34), 



d e \ogdetD^ eh (x)\ e=0 = trace | [D^(x)] 1 f\[{D^ ul h u } o ^ n o ^(x)]^J 



trace 



| [£>#(*)] _1 j D[{D<FM ° 
/ trace {£>[{£>^ 1 / lu }o^ u ( I/ )] Ida 



y=0»(x) 



duD4>l{x) 



Now notice that 



trace D[{D^ ul K} o = trace [{23(23^^)} o fl^WiJ] 

= trace [{DWyj}^)" 1 ] o ^ n 

= (ft, o (V log det D(j) v ul ) o 0; t4 ) d + (div h u ) o 

The last line follows from the identity: trace [{D [D(j) v ul h u ] }{D(j) v ul ) _1 ] = (ft u , Vlogdet D(j) v ul ) d 
trace(Z3ft n ). Therefore 

a e logdetD^ +e/l (x)| e=0 = / [ft u . Vlogdet D(j) v ul + div ft J o <j> v u {x)du. 

6 Jo L J 



□ 
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